Data Report — CDC Diabetes Health Indicators

Source: UCI dataset 891

SemMap JSON-LD: dataset.semmap.json · RDFa HTML

Overview

Metric Value
Dataset CDC Diabetes Health Indicators
Source UCI dataset 891
Rows 253,680
Columns 22
Discrete 19
Continuous 3
SemMap SemMap JSON-LD
SemMap HTML
Missingness Not modeled

Variables and summary

variable inferred dist
HighBP discrete Reported high blood pressure [1]: 108829 (42.90%)
HighChol discrete Reported high cholesterol [1]: 107591 (42.41%)
CholCheck discrete Cholesterol check in past five years [1]: 244210 (96.27%)
BMI continuous 28.3824 ± 6.6087 [12, 24, 27, 31, 98]
Smoker discrete At least 100 cigarettes ever [1]: 112423 (44.32%)
Stroke discrete Stroke diagnosis [1]: 10292 (4.06%)
HeartDiseaseorAttack discrete CHD or MI diagnosis [1]: 23893 (9.42%)
PhysActivity discrete Physical activity reported [1]: 191920 (75.65%)
Fruits discrete Daily fruit consumption [1]: 160898 (63.43%)
Veggies discrete Daily vegetable consumption [1]: 205841 (81.14%)
HvyAlcoholConsump discrete Alcohol consumption above heavy threshold [1]: 14256 (5.62%)
AnyHealthcare discrete Has health care coverage [1]: 241263 (95.11%)
NoDocbcCost discrete Cost prevented doctor visit [1]: 21354 (8.42%)
GenHlth discrete Very good health [2]: 89084 (35.12%)
Good health [3]: 75646 (29.82%)
Excellent health [1]: 45299 (17.86%)
Fair health [4]: 31570 (12.44%)
Poor health [5]: 12081 (4.76%)
MentHlth continuous 3.1848 ± 7.4128 [0, 0, 0, 2, 30]
PhysHlth continuous 4.2421 ± 8.7180 [0, 0, 0, 3, 30]
DiffWalk discrete Reported difficulty walking or climbing stairs [1]: 42675 (16.82%)
Sex discrete Male [1]: 111706 (44.03%)
Age discrete 60–64 years [9]: 33244 (13.10%)
65–69 years [10]: 32194 (12.69%)
55–59 years [8]: 30832 (12.15%)
50–54 years [7]: 26314 (10.37%)
70–74 years [11]: 23533 (9.28%)
45–49 years [6]: 19819 (7.81%)
80 years or older [13]: 17363 (6.84%)
40–44 years [5]: 16157 (6.37%)
75–79 years [12]: 15980 (6.30%)
35–39 years [4]: 13823 (5.45%)
… (+3 more)
Education discrete College 4 years or more (college graduate) [6]: 107325 (42.31%)
College 1–3 years (some college or technical school) [5]: 69910 (27.56%)
Grade 12 or GED (high school graduate) [4]: 62750 (24.74%)
Grades 9–11 (some high school) [3]: 9478 (3.74%)
Grades 1–8 (elementary) [2]: 4043 (1.59%)
Never attended school or only kindergarten [1]: 174 (0.07%)
Income discrete $75,000 or more [8]: 90385 (35.63%)
$50,000 to <$75,000 [7]: 43219 (17.04%)
$35,000 to <$50,000 [6]: 36470 (14.38%)
$25,000 to <$35,000 [5]: 25883 (10.20%)
$20,000 to <$25,000 [4]: 20135 (7.94%)
$15,000 to <$20,000 [3]: 15994 (6.30%)
$10,000 to <$15,000 [2]: 11783 (4.64%)
Less than $10,000 [1]: 9811 (3.87%)
Diabetes_binary discrete Prediabetes or diabetes diagnosis [1]: 35346 (13.93%)

Fidelity summary

umap model backend disc jsd mean disc jsd median cont ks mean cont w1 mean downstream sign match
metasyn metasyn 0.0288 0.0209 0.4691 2.2955 0.5185
clg_mi2 pybnesian 0.0229 0.0186 0.2691 2.9528
semi_mi5 pybnesian 0.0239 0.0157 0.2634 2.9669
ctgan_fast synthcity 0.2127 0.1539 0.802 7.7665
tvae_quick synthcity 0.0864 0.0658 0.3607 1.8369

Privacy summary

model backend n real n synth exact overlap rate near duplicate rate eps nn distance mean k min k pct lt5 k map rare qi reproduction rate identifiability score delta presence
metasyn metasyn 253680 1000 0 0.876 0.1199 1 0.9939 1 0 79
clg_mi2 pybnesian 253680 1000 0 0.945 0.074 1 0.9939 2 0 14.5
semi_mi5 pybnesian 253680 1000 0 0.899 0.0977 1 0.9939 12 0 1.9333
ctgan_fast synthcity 253680 256 0 0.1367 0.3833 1 0.9939 5 0 3.8
tvae_quick synthcity 253680 256 0 0.9258 0.0772 1 0.9939 2 0 4

Models

UMAPDetailsStructure

Real data

Model: metasyn (metasyn)

Per-variable fidelity
variable type KS W1 JSD
HighBP discrete 0.0346
HighChol discrete 0.0356
CholCheck discrete 0.0302
BMI continuous 0.084 0.7286
Smoker discrete 0.0053
Stroke discrete 0.032
HeartDiseaseorAttack discrete 0.0198
PhysActivity discrete 0.0094
Fruits discrete 0.0212
Veggies discrete 0.0017
Downstream metrics
metric value
sign_match_rate 0.5185
formula Diabetes_binary ~ Q('HighBP') + Q('HighChol') + Q('CholCheck') + Q('BMI') + Q('Smoker') + Q('Stroke') + Q('HeartDiseaseorAttack') + Q('PhysActivity') + Q('Fruits') + Q('Veggies') + Q('HvyAlcoholConsump') + Q('AnyHealthcare') + Q('NoDocbcCost') + Q('GenHlth') + Q('MentHlth') + Q('PhysHlth') + Q('DiffWalk') + Q('Sex') + Q('Age') + Q('Education') + Q('Income') + Q('HighBP'):Q('HighChol') + Q('HighChol'):Q('CholCheck') + Q('CholCheck'):Q('BMI') + Q('BMI'):Q('Smoker') + Q('Smoker'):Q('Stroke')
skipped_reason
Privacy metrics
metric value
n_real 253680
n_synth 1000
exact_overlap_rate 0
near_duplicate_rate_eps 0.876
nn_distance_mean 0.1199
k_min 1
k_pct_lt5 0.9939
k_map 1
rare_qi_reproduction_rate 0
delta_presence 79
variable distribution
HighBP core.multinoulli
HighChol core.multinoulli
CholCheck core.multinoulli
BMI core.lognormal
Smoker core.multinoulli
Stroke core.multinoulli
HeartDiseaseorAttack core.multinoulli
PhysActivity core.multinoulli
Fruits core.multinoulli
Veggies core.multinoulli
HvyAlcoholConsump core.multinoulli
AnyHealthcare core.multinoulli
NoDocbcCost core.multinoulli
GenHlth core.multinoulli
MentHlth core.truncated_normal
PhysHlth core.truncated_normal
DiffWalk core.multinoulli
Sex core.multinoulli
Age core.multinoulli
Education core.multinoulli
Income core.multinoulli
Diabetes_binary core.multinoulli

Model: clg_mi2 (pybnesian)

Per-variable fidelity
variable type KS W1 JSD
HighBP discrete 0.0233
HighChol discrete 0.0199
CholCheck discrete 0.0037
BMI continuous 0.1348 1.2745
Smoker discrete 0.0169
Stroke discrete 0.0373
HeartDiseaseorAttack discrete 0.0229
PhysActivity discrete 0.0165
Fruits discrete 0.0186
Veggies discrete 0.0072
Privacy metrics
metric value
n_real 253680
n_synth 1000
exact_overlap_rate 0
near_duplicate_rate_eps 0.945
nn_distance_mean 0.074
k_min 1
k_pct_lt5 0.9939
k_map 2
rare_qi_reproduction_rate 0
delta_presence 14.5

Model: semi_mi5 (pybnesian)

Per-variable fidelity
variable type KS W1 JSD
HighBP discrete 0.0095
HighChol discrete 0.0426
CholCheck discrete 0.0053
BMI continuous 0.0958 1.066
Smoker discrete 0.0044
Stroke discrete 0.0219
HeartDiseaseorAttack discrete 0.0229
PhysActivity discrete 0.0025
Fruits discrete 0.0139
Veggies discrete 0.0309
Privacy metrics
metric value
n_real 253680
n_synth 1000
exact_overlap_rate 0
near_duplicate_rate_eps 0.899
nn_distance_mean 0.0977
k_min 1
k_pct_lt5 0.9939
k_map 12
rare_qi_reproduction_rate 0
delta_presence 1.9333

Model: ctgan_fast (synthcity)

Per-variable fidelity
variable type KS W1 JSD
HighBP discrete 0.2
HighChol discrete 0.1011
CholCheck discrete 0.0582
BMI continuous 0.9613 15.7567
Smoker discrete 0.083
Stroke discrete 0.0789
HeartDiseaseorAttack discrete 0.1539
PhysActivity discrete 0.3231
Fruits discrete 0.0336
Veggies discrete 0.0846
Privacy metrics
metric value
n_real 253680
n_synth 256
exact_overlap_rate 0
near_duplicate_rate_eps 0.1367
nn_distance_mean 0.3833
k_min 1
k_pct_lt5 0.9939
k_map 5
rare_qi_reproduction_rate 0
delta_presence 3.8

Model: tvae_quick (synthcity)

Per-variable fidelity
variable type KS W1 JSD
HighBP discrete 0.0605
HighChol discrete 0.0984
CholCheck discrete 0.0122
BMI continuous 0.2469 2.4764
Smoker discrete 0.0149
Stroke discrete 0.0529
HeartDiseaseorAttack discrete 0.0571
PhysActivity discrete 0.0839
Fruits discrete 0.0658
Veggies discrete 0.1223
Privacy metrics
metric value
n_real 253680
n_synth 256
exact_overlap_rate 0
near_duplicate_rate_eps 0.9258
nn_distance_mean 0.0772
k_min 1
k_pct_lt5 0.9939
k_map 2
rare_qi_reproduction_rate 0
delta_presence 4